-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pydantic models and parse_page for Spanish Wiktionary #392
Conversation
|
||
class BaseModelWrap(BaseModel): | ||
class Config: | ||
extra = "ignore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any keys are added to a pydantic class that are not defined in the class definition, they will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be ignored, many classes in the current code already has __slots__
. And LoggingExtraFieldsModel
will also not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what's the significance of __slots__
in this context but let me explain my reasoning for configuring the base model to ignore extra fields.`
I think there are two goals in this context:
- The pydantic models should describe the output accurately.
- We want to capture as much data as possible.
In the German extractor, for example, I used the following pattern to extract the references to an example from a wikicode template:
for key, value in template_node.template_parameters.items():
if isinstance(key, str):
reference_data[key.lower()] = clean_node(wxr, {}, value)
In this example, the expected keys aren't defined anywhere in our code but rely implicitly on the arguments of that particular template. In this particular case, that makes a lot of sense since the reference template is quite detailed and exactly what we would like to capture.
However, when defining the json schema or using pydantic without the option extra = "ignore"
, there is always the risk that our output has keys that aren't documented anywhere. That might not be a problem in and of itself but certainly makes this data a lot less valuable. We also wouldn't satisfy goal 1.
With the option extra = "ignore"
, only keys that are defined in the pydantic model will be output and all others suppressed. This does satisfy goal 1.
However, we might still like to capture these keys (goal 2). For this reason, I am logging all extra fields that our code tries to assign but aren't defined in the model. This will enable us to make a decision on whether or not to include this key in our model.
In short: The combination of ignoring extra fields while logging these in a debug statement will force us to keep the pydantic models (and thus our output documentation) in sync with what we do in our code. I think this is the main benefit of using pydantic over manually maintaining json schemas (as currently done for the French extractor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Config
class is deprecated, ConfigDict
or model class kwargs should be used: https://docs.pydantic.dev/latest/concepts/config/
The default value of extra
is ignore
, maybe the base model should use forbid
.
I read the doc again and find extra
only controls model initialization, it doesn't check undefined fields after the initialization. Since pydantic model classes are defined now, the extractor code shouldn't use dict like assignment now, they should use class attributes so the update
and get
methods could be removed. Update: I see... our old code like clean_code
uses .get()
, we should update clean_code
then. But if we allow arbitrary fields to be added that kind of miss the point of using pydantic...
Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?
As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough review and pointing my in the right direction!
The default value of extra is ignore, maybe the base model should use forbid.
With the option forbid, we would get an exception every time there is an extra field. This would mean that we need to add error handling every time we instantiate a pydantic model. I think that would be extremely cumbersome and I didn't find a good way to catch this exception in an abstract way... There is an interesting PR suggesting a fourth option warn which would be ideal for our case.
Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?
You're right. The JSON schema file will only contain fields defined in the model classes.
As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.
The main question is thus: Do we want that our JSON schemas strictly describe our output, i.e. the output contains only keys that are described in our JSON schemas? Or do we not mind if the output contains extra fields?
My preference and also the PR's is the first option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this comment. If the code is initializing a pydantic model with undefined fields and get exception, that means the code and the model doesn't match and that's the whole point of validation, right? The only way we could get this exception is from our own mistake but not from the changes in Wiktionary.
OK, I guess it's fine to use the default ignore
setting, since the undefined data are not added.
No, I think it's better to use forbid
, the exception will be caught in page_handler()
and added to the error log.
I have added some code comments to highlight some of the features of pydantic. Don't hesitate to ask questions if things are unclear. |
I think it's fine as long as pydantic and the validation code don't get in the way and don't hurt performance. I'm kind worried about overriding Python std library's |
Hi there! It took longer than expected to come back to this but the next few days should see some progress. Let me comment on the outstanding pydantic questions first.
I am not overriding Python's I also had to add the methods |
I am now a bit stuck since there seems to be problem with Here the same example from amor:
In the Spanish Wiktionary, senses are defined as list items starting with I am not too familiar with the parser itself. If you could point me to the right spot where to fix this, I would be grateful. Thanks! Here bigger example for ayudar:
Senses always seem to start as list items with Intuitively, we would parse this to one big list with two list items It's probably not feasible to change the parsing logic to reflect the intuitive ordering and I will have to deal with this in code. Yet, I would appreciate any suggestion on how best to deal with this. The relevant documentation from Wiktionary is here: Page structure Many thanks in advance! |
Nodes after ":" are saved to the |
e5a4a0a
to
f6013b0
Compare
Thanks, @xxyzz. I haven't worked with the I have added gloss extraction now and so far am quite happy with using pydantics to build the Spanish extractor. If you agree with my approach, I would consider it ready to be merged. I will keep working on extracting more lexical data. Either in this PR or the next one. |
|
||
class BaseModelWrap(BaseModel): | ||
class Config: | ||
extra = "ignore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Config
class is deprecated, ConfigDict
or model class kwargs should be used: https://docs.pydantic.dev/latest/concepts/config/
The default value of extra
is ignore
, maybe the base model should use forbid
.
I read the doc again and find extra
only controls model initialization, it doesn't check undefined fields after the initialization. Since pydantic model classes are defined now, the extractor code shouldn't use dict like assignment now, they should use class attributes so the update
and get
methods could be removed. Update: I see... our old code like clean_code
uses .get()
, we should update clean_code
then. But if we allow arbitrary fields to be added that kind of miss the point of using pydantic...
Isn't the JSON schema file generated from the defined model classes? How could they contain fields created dynamically at runtime?
As for your German data example, I think we should not check the fields(or generate JSON schema) not created in our code.
class LoggingExtraFieldsModel(BaseModelWrap): | ||
@model_validator(mode="before") | ||
def log_extra_fields(cls, values): | ||
all_allowed_field_names = {key for key in cls.__fields__.keys()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__fields__
is deprecated, model_fields_set
should be used.
Maybe you'd like to make this debug log disabled by default, because there will be lots of debug messages spamming the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! It's model_fields
that we need here. No idea, why they don't mention this property in the doc but it's clearly the replacement for __fields__
.
Maybe you'd like to make this debug log disabled by default, because there will be lots of debug messages spamming the log.
Interestingly, there shouldn't be a lot of spam from this logging call. At least how I see it used during development and maintaining of the extractors:
- If we set a field manually that isn't defined in a model, that's clearly an oversight and the model should be updated. To not miss this, the log can be useful and would disappear afterwards.
for key, value in template_node.template_parameters.items():
if isinstance(key, str):
reference_data[key.lower()] = clean_node(wxr, {}, value)
- In a case like above, the code suggests that the decision has been made to indeed capture all the fields. Than it's just a matter of running the extractor code once and update the models accordingly. Again, the debug messages would disappear afterwards.
- If only same keys from that
template_node
are supposed to be extracted, then this should be done explicitly in code. Again not that many/no debug messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undefined field shouldn't be allowed to be assigned. In your template parameters case, we could defined that field in its parent model as dictionary or Any
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid not specifying the keys in a dictionary. IMO, the JSON schema is there to inform users what to expect. Extracting data into an unspecified dict is perhaps still better than not extracting but it's immensely yes useful to many applications without knowledge what's in the dict.
I feel like we are still not on the same page. Probably due to a lack of good examples. Perhaps looking at my WIP branch es-ahead
will help:
- Here I apply the template parameters case to parse a reference template. When I encounter an unknown key, I get a debug message and can update the pydantic model.
Obviously, there are other ways to achieve the same effect. I thought it intuitive to also delegate the task of notifying us about the ignored fields to pydantic, since (with the option extra="ignore
) it checks for unspecified fields anyway. But if you think it's a bad idea, I can get rid of LoggingExtraFieldsModel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know you would break the template arguments that much... I don't know how pydantic model could fit in this case, maybe it shouldn't be used for dynamically created data because that code will break every time when the template is changed. I'm not sure... You could experiment with the code and see what is best for your usage cases and also easy to maintain.
Other extractors mostly use a ref
field for example source, maybe you could only define ref
field in model and copy template arguments to a dictionary field which won't be checked. But that's just my suggestion, you could use other implementations.
I think the most benefit we could get from pydantic is the validation, so random things won't be added from somewhere no one is watching. If arbitrary keys are still be able to be added we're no better than before. Some fields like saving template arguments should be allowed to do that but not other models/fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So even though the code looks like it's adding data to undefined field but in reality it is not. If you use pydantic this way, the code will be much longer and require more efforts to maintain, and therefore it's worse then our current defaultdict
approach. I think you have to make compromise to the JSON schema, I don't think you could use dynamic model creation. Don't use pydantic model for dynamically created data is the best solution IMO, otherwise you might need more complex code and it's not worth it.
And we already have a JSON data structure browser page on kaikki.org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xxyzz I am very grateful for your comments and that you're really thinking this through alongside with me but I have to admit that right now I am also a bit lost what your exact worries are. So I am hesitant to change something in the code right now without knowing exactly what I am addressing.
Perhaps, I can try to address your points in writing and you could specify what needs to be addressed/changed in my proposal?
Just to make sure we are on the same page, here are the three different options how a pydantic model deals with extra values:
allow
- Allow any extra attributes.forbid
- Forbid any extra attributes.ignore
- Ignore any extra attributes.
With the default value of extra="ignore"
, your example does result in:
from pydantic import BaseModel, ConfigDict, model_validator
class MyBaseModel(BaseModel):
model_config = ConfigDict(validate_assignment=True)
username: str
m = MyBaseModel(username="a", password="b") # (A) no error, password gets ignored
getattr(m, "password") # (B) AttributeError: 'MyBaseModel' object has no attribute 'password'
setattr(m, "foo", "bar") # (C) pydantic_core._pydantic_core.ValidationError: 1 validation error for MyBaseModel
foo
print(getattr(m, "foo")) # (D) AttributeError: 'MyBaseModel' object has no attribute 'foo'
(A) Works exactly as intended:
password
gets ignored (we only get keys in our output that is described in our schema/classes)- We do not get an error and the code just continues executing.
- When inheriting from
class LoggingExtraFieldsModel(BaseModelWrap)
, the extra fieldpassword
would get logged and we could update the schema/class. - And we still get type validation on the defined field
username
.
(B) and (D) do not worry me:
- Instance attributes should be accessed as attributes:
m.username
- That also makes for a better developer experience since my IDE tells me which attributes exist.
- We could, of course, overwrite the
getattr
function to return None for undefined keys (likedefaultdict
) but I don't see a point in this. - If absolutely necessary, one could always use
try/except
.
(C) is in my eyes a bit problematic but solvable:
- Indeed, I realize now that my previous example of
add_template_params_to_reference()
does only work since I defined the dict-like setter using an exception:
class BaseModelWrap(BaseModel):
def __setitem__(self, item, value):
try:
setattr(self, item, value)
except ValidationError:
pass
- Without redefining
__setitem__
, I would need to usetry/catch
locally. - This would then mimic the behaviour of extra arguments on class instantiation: Extra arguments would just be ignored but logged, and an update of an existing key with the wrong type would simply not be executed. IMO, the behaviour we want.
I hope that clear up the technical question marks.
I will now try to answer your concerns.
I don't know how pydantic model could fit in this case, maybe it shouldn't be used for dynamically created data because that code will break every time when the template is changed.
No. As I see it, nothing would break since unknown keys don't throw errors (case (A)) or would need to be caught anway (case (C)). If a template would change and introduce a new key, e.g. {template|new=foo}
, we would simply not extract the new
key and potentially get a debug message.
Maybe you could only define ref field in model and copy template arguments to a dictionary field which won't be checked.
I prefer making the structure explicit but, yes, that is a valid option that we can flexibly choose. For example, if you were to adopt pydantic for the French extractor you could type ref
as an iterable type:
class Example(BaseClass):
ref: dict = None
or
class Example(BaseClass):
ref: Dict[str,str] = None
Or if you wanted to define some keys but not all you could overwrite the "extra" configuration:
class Reference(BaseClass):
model_config = ConfigDict(extra="allow")
title: str
class Example(BaseClass):
ref: Reference
ref = Reference(title="Hamlet",author="Shakespeare")
print(ref.model_dump())
# {"title":"Hamlet", "author":"Shakespeare"}
If you use pydantic this way, the code will be much longer and require more efforts to maintain, and therefore it's worse then our current defaultdict approach.
I fail to see it. It might a little bit more work initially but, IMO, is much more maintainable than manual JSON schemas.
Once some base classes are defined for the Spanish extractor, they can also be reused for other extractors via inheritance while still easily specifying where they differ, e.g.:
class PronunciationBase(BaseClass):
ipa: str
class SpanishPronunciation(PronunciationBase):
syllabic: List[str] = Field(default=[], description="Syllabic transcription")
class RussianPronunciation(PronunciationBase):
stress: str = Field(None, description="Word with stress markings")
Or am I missing something?
dynamic model creation
I wasn't aware of that. Thanks! It's a cool feature and we could theoretically use it to output a dynamic schema at the end of each extraction run. But it would be a hassle with multithreading and add very little benefit in the end...
And we already have a JSON data structure browser page on kaikki.org.
Very useful indeed! But as I understand it, this is more exploratory in its nature? It still doesn't replace a proper and detailed JSON schema.
Perhaps lastly, let me explain my use case: I want to convert the JSON Schema for each extractor to a Typescript type (either from the schema or from pydantic directly (pydantic-to-typescript
)) and then use these types to build a frontend, flexibly displaying lexical information extracted from wiktextract. Having these types during development will make things so much easier.
This is why I think our schema should be as detailed as possible. I am almost certain, I wouldn't use a field like ref: Dict[str,str] = None
since I had no idea what's in there. But in
class Reference(BaseClass):
title: str
I might at least use the title
field.
I hope that cleared things up.
Where do you see blocking issues at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our main disagreement is our priority. Our priority is getting more data from the dump file, but your priority is getting JSON schema. Your code can't adapt to changes on Wiktionary, it needs constant maintenance to sync models with templates to add new template arguments. Because you're using pydantic model on data that shouldn't be validated.
You can get better JSON schema by running through the generated jsonl file like what kaikki.org did. And if you're trying to use dynamic model creation, that'd be just like scanning the jsonl file, but more meaningless validation and slower.
Back to my example code, all 4 errors are bugs in our Python code and shouldn't be ignored. Set forbid
, we get error log with one line code. You have to loose the validation on data you don't have control but have to add and harden the validation on data you do control. I think you're pursuing a goal that pydantic can't fulfill. But I haven't use pydantic before, I could be wrong.
I think you have a minor misunderstanding about the ref
field in other extractors, it is a str
, the plain text of the example sentence source. In the French extractor there are no template arguments dictionary but the English extractor has many. dict[str, str]
or Any
is exactly what I suggested before but you choose to use a pydantic model. It's even more easier if you don't translate the arguments to English but just copy them.
We could get this pr merged if you could update the typing, some are still using Dict
. I think process_pos_block()
could get index out of range
error because page_data
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Got it.
I think there is a way to have our cake and eat it when it comes to template arguments dictionaries. But let's deal with it when I get to it. If it won't work out then, I am fine with using dict[str,str]
for these kind of fields.
I think you have a minor misunderstanding about the ref field in other extractors, it is a str, the plain text of the example sentence source.
I am aware of this. At the very beginning when I started contributing here, someone told me that you do not want to enforce a strict schema for all extractors but rather see what makes sense for each Wiktionary edition. So while modelling the new extractors on the existing ones, I didn't think it would hurt to be more specific. In the case of ref
, why only extract the plain text if there is more structured data to have? In the German extractor, for example, I have stored the plain text with the raw_ref
within a ref
dictionary.
We could get this pr merged if you could update the typing, some are still using Dict. I think process_pos_block() could get index out of range error because page_data is empty.
Done. process_pos_block
doesn't get an index out of range error since it appends a copy of base_data
into page_data
before using index access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we're not using the same schema but we also don't want to much variations. In the case of quote template arguments, we could get the same result with more stable and clean code by copying them without validation and translation.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
This work is a contribution to the EWOK project, which receives funding from LABEX ASLAN (ANR–10–LABX–0081) at the Université de Lyon, as part of the "Investissements d'Avenir" program initiated and overseen by the Agence Nationale de la Recherche (ANR) in France.
Thanks for your contribution! |
Thanks for your contribution from me too! |
Hi there,
I started to add basic support for the Spanish Wiktionary and attempted to integrate with pydantic for data validation and schema generation.
Why I think pydantic might be a good addition to the project:
The potential downsides:
At the moment, I have only setup the basic pydantic models and parse_page function for the Spanish Wiktionary. I will still need to build a little more to properly assess whether pydantic really fits the use case and is easy to work with.
For now, I will mark this as a draft and look forward to hear your feedback and answer potential questions you have regarding pydantic.